{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-c833698b7dad927d",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "# Exercise 6: Multi-Step Bootstrapping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-7cf627dacfec200a",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "In this exercise we will have a look at n-step methods and eligibility trace. The n-step methods are a class of reinforcement learning algorithms that are an abstraction of the Monte Carlo and TD(0) methods discussed earlier and include them as special cases. Furthermore, we also consider the eligibility traces, which take a reverse approach to determining the state values. The environment we will be dealing with is a little more typical for control research: the inverted pendulum. \n",
    "\n",
    "![](https://miro.medium.com/max/1000/1*TNo3x9zDi1lVOH_3ncG7Aw.gif)\n",
    "\n",
    "To implement this environment, we will make use of the gymnasium library. Please install the gymnasium library within your preferred Python environment using:\n",
    "\n",
    "```pip install gymnasium```\n",
    "\n",
    "**Note: Use ```done = terminated or truncated``` for the end of the episode**\n",
    "\n",
    "Note that the episodes in this environment end with a truncation and not a termination. For this exercise, we assume that for the environment truncation and termination is the same. That is wherever, we are looking for a terminal state we instead look for the state where the episode is truncated! (Generally, termination and truncation are different things. This is the case because the timelimit is not actually part of the MDP but rather a constraint set from the outside. More on this [here](https://gymnasium.farama.org/tutorials/gymnasium_basics/handling_time_limits/))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-68ba542456c544d2",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import gymnasium as gym\n",
    "from gymnasium.wrappers import TimeLimit\n",
    "from tqdm.notebook import tqdm\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('seaborn-v0_8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-b9853bfaec1d8013",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Check if the installation and import work by executing the following cell. A window with an animation of the pendulum should open, display some random actions, and close automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-8e133fbe2615fd5b",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v1', render_mode=\"human\")\n",
    "env = env.unwrapped  # removes a built-in time limit of k_T = 200, we want to set the time limit ourselves\n",
    "env = TimeLimit(env, max_episode_steps = 300)\n",
    "\n",
    "state, _ = env.reset()\n",
    "for _ in range(300):\n",
    "    env.render()\n",
    "    state, reward, terminated, truncated, _ = env.step(env.action_space.sample()) # take a random action\n",
    "    done = terminated or truncated\n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-f29246bdc3e421c0",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "The goal of this environment is to bring the pendulum into the upper neutral position, where the angle $\\theta = 0$ and the angular velocitiy $\\frac{\\text{d}}{\\text{d}t}\\theta=\\omega=0$. The reward function is already designed that way and does not need further specification. For further information about the environment you may refer to the code and documentation of Farama Foundation's `gymnasium`:\n",
    "\n",
    "[Documentation of the gymnasium pendulum](https://gymnasium.farama.org/environments/classic_control/pendulum/)\n",
    "\n",
    "[Pendulum environment in the gymnasium Github repository](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/pendulum.py)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-8570b84cc28ffc4d",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Discretization of Action and State Space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ee59383187979c94",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Unlike the racetrack environment, the inverted pendulum comes with a continuous action and state space. Although it is possible to handle systems with these characteristics, we did not yet learn how to deal with them. For now, we only know how to implement agents for discrete action and state spaces. As a result, we will also try to represent the inverted pendulum within a discrete state / action space through discretization.\n",
    "\n",
    "The pendulum has three state variables relating to the momentary angular position $\\theta$:\n",
    "$$\n",
    "\\begin{align*}\n",
    "    x=\\begin{bmatrix}\n",
    "    \\text{cos}(\\theta)\\\\\n",
    "    \\text{sin}(\\theta)\\\\\n",
    "    \\frac{\\text{d}}{\\text{d}t}\\theta\n",
    "    \\end{bmatrix}\n",
    "    \\in\n",
    "    \\begin{bmatrix}\n",
    "    [-1, 1]\\\\\n",
    "    [-1, 1]\\\\\n",
    "    [-8 \\, \\frac{1}{\\text{s}}, 8 \\, \\frac{1}{\\text{s}}]\n",
    "    \\end{bmatrix},\n",
    "\\end{align*}\n",
    "$$\n",
    "and one input variable which relates to the torque applied at the axis of rotation:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "    u = T \\in [-2 \\, \\text{N}\\cdot\\text{m}, 2 \\, \\text{N}\\cdot\\text{m}]\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "After the discretization, we want the system to be defined on sets of non-negative natural numbers:\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "    x_d =\n",
    "    \\text{discretize\\_state}(x)\n",
    "    \\in\n",
    "    \\begin{bmatrix}\n",
    "    \\{0,1,2,...,d_{\\theta}-1\\}\\\\\n",
    "    \\{0,1,2,...,d_{\\theta}-1\\}\\\\\n",
    "    \\{0,1,2,...,d_{\\omega}-1\\}\n",
    "    \\end{bmatrix},\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "    u_d =\n",
    "    \\text{discretize\\_action}(u)\n",
    "    \\in\n",
    "    \\{0,1,2,...,d_{T}-1\\}.\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "Since action is selected within the discrete action space, we need to transform it accordingly:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "    u=\n",
    "    \\text{continualize\\_action}(u_d):\n",
    "    \\{0,1,2,...,d_{T}-1\\} \\rightarrow [-2 \\, \\text{N}\\cdot\\text{m}, 2 \\, \\text{N}\\cdot\\text{m}]\n",
    "    .\n",
    "\\end{align*}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-28b6b992373b4a65",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Write the functions `discretize_state` and `continualize_action`, such that a discrete RL agent can be applied. (Please note that all I/O of `gymnasium` consists of numpy arrays.) Write the functions in such a way that the number of discretization intervals $d_\\theta, d_\\omega, d_T$ are parameters that can be changed for different tests. The discretization intervals should be uniformly distributed on their respective state space.\n",
    "\n",
    "A parametrization of $d_\\theta = d_\\omega = d_T = 15$ can be used to yield satisfactory results in this exercise.\n",
    "However, does it make a difference if the number of discretization intervals is odd or even? If yes, what should be preferred for the given environment? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-cf67ba4807c7ce8c",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "Code given below.\n",
    "\n",
    "The number of discretization intervals can in fact make a difference in this case. The inverted pendulum can be considered as solved when brought into the upper neutral position $\\theta=0, \\omega=0$. The state as given by `gymnasium` would therefore be:\n",
    "\n",
    "\\begin{align*}\n",
    "x_\\text{neutral}=\n",
    "\\begin{bmatrix}\n",
    "\\text{cos}(0)\\\\\n",
    "\\text{sin}(0)\\\\\n",
    "0\n",
    "\\end{bmatrix}\n",
    "=\n",
    "\\begin{bmatrix}\n",
    "1\\\\\n",
    "0\\\\\n",
    "0\n",
    "\\end{bmatrix},\n",
    "u_\\text{neutral}=0\n",
    "\\end{align*}\n",
    "\n",
    "Consequently, the discretization / continualization should allow for precise transformation of this state, which is given when assuming an odd number of discretization intervals. If one uses an even number of intervals, one interval boundary will be located exactly at zero, potentially leading to rapid bouncing around the neutral position."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-af38d4d166803785",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "d_theta = 15\n",
    "d_omega = 15\n",
    "d_T = 15\n",
    "\n",
    "def discretize_state(states):\n",
    "    limits = [1, 1, 8]\n",
    "    nb_disc_intervals = [d_theta, d_theta, d_omega]\n",
    "\n",
    "    # bring to value range [-1, 1]\n",
    "    norm_states = [state / limit for state, limit in zip(states, limits)]\n",
    "    interval_lengths = [2 / d for d in nb_disc_intervals]\n",
    "    disc_state = [(norm_state + 1) // interval_length for norm_state,\n",
    "                  interval_length in zip(norm_states, interval_lengths)]\n",
    "    disc_state = [(state - 1) if state == d else state for state,\n",
    "                  d in zip(disc_state, nb_disc_intervals)]  # ensure that disc_state < d\n",
    "\n",
    "    return np.array(disc_state, dtype=int)\n",
    "\n",
    "\n",
    "def continualize_action(disc_action):\n",
    "    limit = 2\n",
    "    interval_length = 2 / (d_T-1)\n",
    "    norm_action = disc_action * interval_length\n",
    "    cont_action = (norm_action - 1) * limit\n",
    "    return np.array(cont_action).flatten()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-54296429c6a25f98",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Use the following cell for debugging:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-755b0b9277910870",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v1', render_mode=\"human\")\n",
    "state, _ = env.reset()\n",
    "for _ in range(5):\n",
    "    disc_action = np.random.choice(range(9))\n",
    "    cont_action = continualize_action(disc_action)\n",
    "    print(\"discrete action: {}, continuous action: {}\".format(disc_action, cont_action))\n",
    "    \n",
    "    state, reward, terminated, truncated, _ = env.step(cont_action) # take a random action\n",
    "    done = terminated or truncated\n",
    "    disc_state = discretize_state(state)\n",
    "    print(\"discrete state: {}, continuous state: {}\".format(disc_state, state))\n",
    "    \n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-7050729ab9b288bc",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "## Discretization Diagrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-191ecd76d4787fb5",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "d_T = 15\n",
    "\n",
    "cos_theta = np.linspace(-1, 1, 10000)\n",
    "sin_theta = np.linspace(-1, 1, 10000)\n",
    "omega = np.linspace(-8, 8, 10000)\n",
    "T = np.arange(0, d_T, 1)\n",
    "\n",
    "disc_states = np.array([discretize_state(np.array([c, s, o])) for c, s, o in zip(cos_theta, sin_theta, omega)])\n",
    "cont_actions = [continualize_action(np.array(t)) for t in T]\n",
    "\n",
    "plt.plot(cos_theta, disc_states[:, 0])\n",
    "plt.xlabel(r\"cos$(\\theta)$, sin$(\\theta)$\")\n",
    "plt.ylabel(r\"cos${}_d$, sin${}_d$\")\n",
    "plt.grid('major')\n",
    "plt.show()\n",
    "plt.plot(omega, disc_states[:, 2])\n",
    "plt.xlabel(r\"$\\omega / \\frac{1}{\\mathrm{s}}$\")\n",
    "plt.ylabel(r\"$\\omega_d$\")\n",
    "plt.grid('major')\n",
    "plt.show()\n",
    "plt.plot(T, cont_actions, 'o-')\n",
    "plt.xlabel(r\"$T_d$\")\n",
    "plt.ylabel(r\"$T / \\mathrm{N} \\cdot \\mathrm{m}$\")\n",
    "plt.grid('major')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-2575b2d2065717a7",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 1) n-Step Sarsa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-71c349849a7bdad7",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Write an on-policy n-step Sarsa control algorithm for the inverted pendulum. \n",
    "\n",
    "Use the following parameters: $\\alpha=0.1, \\gamma=0.9, \\varepsilon=0.1, n=10$ with 500 time steps in 2000 episodes.\n",
    "\n",
    "![](nStepSARSA_Algo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-877e2e0ac6a7510e",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "## Solution 1)\n",
    "\n",
    "Execution might take long due to the \"render\" command, but this allows to observe the learning. Comment out to execute faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy(pi, state, deterministic, epsilon):\n",
    "    \"\"\"Decide on an action given the current state.\n",
    "\n",
    "    Args:\n",
    "        pi: The policy\n",
    "        state: The current state of the environment\n",
    "        epsilon: Probability for random action in eps-greedy\n",
    "\n",
    "    Returns:\n",
    "        action: The chosen action\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return action\n",
    "    \n",
    "\n",
    "def interact(pi, action, done, states, rewards, actions, k, k_T, epsilon):\n",
    "    \"\"\"Interact with the environment to get to the next state. Note that\n",
    "    the interaction is a different here, as this is a SARSA based algorithm.\n",
    "    The action has already been decided on in the last step.\n",
    "\n",
    "    Args:\n",
    "        pi: The current policy\n",
    "        action: The chosen action\n",
    "        done: Whether the current episode is truncated/terminated\n",
    "        states: A list of the states visited in the current episode\n",
    "        rewards: A list of rewards gathered in the current episode\n",
    "        actions: A list of actions applied in the current episode\n",
    "        k: The current time index of the episode\n",
    "        k_T: The termination time (initialized at +inf and set later to be able to deal with\n",
    "            cases with variable episode lengths, e.g. the race track from previous exercises)\n",
    "        epsilon: Probability for random action in eps-greedy\n",
    "\n",
    "    Returns:\n",
    "        next_state: The state after the interaction\n",
    "        next_action: The action for the next timestep\n",
    "        done: Whether the current episode is truncated/terminated in the next state\n",
    "        states: A list of the states visited in the current episode including the new state\n",
    "        rewards: A list of rewards gathered in the current episode including the new reward\n",
    "        actions: A list of actions applied in the current episode including the new action\n",
    "        k_T: The termination time, possibly updated\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return next_state, next_action, done, states, rewards, actions, k_T\n",
    "    \n",
    "\n",
    "def learn(action_values, pi, states, actions, rewards, done, k, n, tau, discount_array, gamma, alpha, k_T):\n",
    "    \"\"\"Learn from your gathered data using n-step SARSA.\n",
    "\n",
    "    Args:\n",
    "        action_values: The action-values before the update\n",
    "        pi: The policy before the update\n",
    "        states: A list of the states visited in the current episode\n",
    "        actions: A list of actions applied in the current episode\n",
    "        rewards: A list of rewards gathered in the current episode\n",
    "        done: Whether the current episode is truncated/terminated\n",
    "        k: The current time index of the episode\n",
    "        n: The number of learning steps\n",
    "        tau: The time index for updating the estimate (lags behind the time index of the episode)\n",
    "        discount_array: An array to weight the n learning steps with the discount factor\n",
    "        gamma: The discount factor\n",
    "        alpha: The step size / learning rate\n",
    "\n",
    "    Returns:\n",
    "        action_values: The updated action values\n",
    "        pi: The updated policy\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return action_values, pi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-37c9e8a6c5268048",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v1') # , render_mode=\"human\"\n",
    "env = env.unwrapped\n",
    "\n",
    "nb_episodes = 2000  # number of episodes\n",
    "nb_steps = 500  # length of episodes\n",
    "env = TimeLimit(env, max_episode_steps=nb_steps)  # wrap the env for the new maximum step size\n",
    "\n",
    "alpha = 0.1  # learning rate\n",
    "gamma = 0.9  # discount factor\n",
    "epsilon = 0.1  # epsilon greedy parameter\n",
    "n = 10  # prediction steps\n",
    "\n",
    "action_values = np.zeros([d_theta, d_theta, d_omega, d_T])\n",
    "# int is necessary for indexing\n",
    "pi = np.zeros([d_theta, d_theta, d_omega], dtype=int)\n",
    "\n",
    "# we can use this to figure out how well the learning worked\n",
    "cumulative_reward_history = []\n",
    "\n",
    "for j in tqdm(range(nb_episodes), position=0, leave=True):\n",
    "\n",
    "    states = []\n",
    "    actions = []\n",
    "    rewards = []\n",
    "\n",
    "    # will be multiplied with the last rewards\n",
    "    discount_array = gamma ** np.arange(n)\n",
    "\n",
    "    cont_state, _ = env.reset()  # initialize x_0\n",
    "    done = False\n",
    "\n",
    "    state = tuple(discretize_state(cont_state))  # use tuple indexing\n",
    "    action = pi[state]\n",
    "\n",
    "    states.append(state)\n",
    "    actions.append(tuple([action]))\n",
    "\n",
    "    # terminal time\n",
    "    k_T = np.inf\n",
    "    k = 0\n",
    "\n",
    "    while True:\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "        k += 1\n",
    "\n",
    "    cumulative_reward_history.append(np.sum(rewards))\n",
    "\n",
    "env.close()\n",
    "pi_learned = np.copy(pi)  # save pi in cache under different name for later"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(cumulative_reward_history)\n",
    "plt.xlabel(\"episode\")\n",
    "plt.ylabel(r\"$\\sum R$\")\n",
    "plt.show()\n",
    "\n",
    "print(np.shape(cumulative_reward_history))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ddebe8848a817b91",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Greedy Execution\n",
    "\n",
    "Test the learned policy by pure greedy execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-6ffa29bb63e9fc42",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v1', render_mode=\"human\")\n",
    "env = env.unwrapped\n",
    "\n",
    "nb_steps = 200\n",
    "env = TimeLimit(env, max_episode_steps=nb_steps)\n",
    "\n",
    "\n",
    "state, _ = env.reset() # initialize x_0\n",
    "disc_state = tuple(discretize_state(state)) # use tuple indexing\n",
    "disc_action = pi_learned[disc_state]\n",
    "\n",
    "while True:\n",
    "        \n",
    "    cont_action = continualize_action(disc_action)\n",
    "    env.render() # comment out for faster execution\n",
    "    state, reward, terminated, truncated, _ = env.step(cont_action)\n",
    "    done = terminated or truncated\n",
    "\n",
    "    disc_state = tuple(discretize_state(state))\n",
    "        \n",
    "    if done:\n",
    "        break\n",
    "\n",
    "    disc_action = pi_learned[disc_state] # exploitative action\n",
    "    \n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-4731b891d975389f",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "source": [
    "## 2) Recursive updates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-24a77a257e20058a",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "source": [
    "Both, $n$-step and $\\lambda$-return updates, are based on a forward view. That means we have to wait for future states and rewards before an update can be performed.\n",
    "We therefore introduced the concept of eligibility traces, which follows the general idea that previous actions have significantly led to the current situation. Contrary to n-step learning, however, intuition tells us that more recent decisions had a more severe impact on the present situation than decisions that were made a long time ago. Thus, it may be helpful to integrate a forgetting factor $\\lambda$ which decreases the assumed influence of actions over time.\n",
    "\n",
    "Solution 2 is now to be changed by using eligibility traces $z_k(x_k)$ within the action-value update, resulting in SARSA($\\lambda$). Note that the code is overall a lot less complex than the code from task 2 and much closer to the \"normal\" TD(0)-SARSA.\n",
    "\n",
    "Test it for different values of $\\lambda$. How sensitive is the process to the choice of $\\lambda$?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def interact(pi, action, deterministic, epsilon):\n",
    "    \"\"\"Interact with the environment to get to the next state. Note that\n",
    "    the interaction is a different here, as this is a SARSA based algorithm.\n",
    "    The action has already been decided on in the last step.\n",
    "\n",
    "    Args:\n",
    "        pi: The current policy\n",
    "        action: The chosen action\n",
    "\n",
    "    Returns:\n",
    "        next_state: The state after the interaction\n",
    "        reward: The reward for the current interaction\n",
    "        next_action: The action for the next timestep\n",
    "        done: Whether the current episode is truncated/terminated in the next state\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return next_state, reward, next_action, done\n",
    "\n",
    "\n",
    "def learn(action_values, pi, state, next_state, reward, action, next_action, eligibility, lambd, gamma, alpha):\n",
    "    \"\"\"Learn from your gathered data using SARSA(lambda). Note that the structures to deal with \n",
    "    the terminal state from the last task are not needed in this algorithm. The information from\n",
    "    the past is saved in the eligibility trace.\n",
    "\n",
    "    Args:\n",
    "        action_values: The action-values before the update\n",
    "        pi: The policy before the update\n",
    "        state: State before the last interaction\n",
    "        next_state: State after the last interaction\n",
    "        reward: Reward for the interaction\n",
    "        action: Action in the last interaction\n",
    "        next_action: Action for the coming interaction\n",
    "        eligibility: The eligibility trace\n",
    "        lambd: Decay factor\n",
    "        gamma: Discount factor\n",
    "        alpha: Step size / learning rate\n",
    "\n",
    "    Returns:\n",
    "        action_values: The updated action values\n",
    "        pi: The updated policy\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return action_values, pi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-8872e02f8c41136a",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v1') # , render_mode=\"human\"\n",
    "env = env.unwrapped\n",
    "\n",
    "nb_episodes = 2000  # number of episodes\n",
    "nb_steps = 500  # length of episodes\n",
    "env = TimeLimit(env, max_episode_steps=nb_steps)\n",
    "\n",
    "alpha = 0.1  # learning rate\n",
    "gamma = 0.9  # discount factor\n",
    "epsilon = 0.1  # epsilon greedy parameter\n",
    "lambd = 0.8  # forgetting factor\n",
    "\n",
    "action_values = np.zeros([d_theta, d_theta, d_omega, d_T])\n",
    "# int is necessary for indexing\n",
    "pi = np.zeros([d_theta, d_theta, d_omega], dtype=int)\n",
    "\n",
    "# we can use this to figure out how well the learning worked\n",
    "cumulative_reward_history = []\n",
    "\n",
    "for j in tqdm(range(nb_episodes), position=0, leave=True):\n",
    "\n",
    "    # init eligibility trace, acts as the memory of this algorithm\n",
    "    eligibility = np.zeros([d_theta, d_theta, d_omega, d_T])\n",
    "\n",
    "    cont_state, _ = env.reset()  # initialize x_0\n",
    "    done = False\n",
    "\n",
    "    state = tuple(discretize_state(cont_state))  # use tuple indexing\n",
    "    action = pi[state]\n",
    "\n",
    "    # update eligibilities\n",
    "    eligibility *= lambd * gamma\n",
    "    eligibility[state, (action,)] += 1.0\n",
    "\n",
    "    # terminal time\n",
    "    k_T = None\n",
    "    k = 0\n",
    "\n",
    "    rewards = 0\n",
    "\n",
    "    while True:\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "        rewards += reward\n",
    "        if done:\n",
    "            break\n",
    "\n",
    "    cumulative_reward_history.append(np.sum(rewards))\n",
    "\n",
    "env.close()\n",
    "\n",
    "### END SOLUTION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-05f58d9e3aabc348",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "plt.plot(cumulative_reward_history)\n",
    "plt.xlabel(\"episode\")\n",
    "plt.ylabel(r\"$\\sum R$\")\n",
    "plt.show()\n",
    "\n",
    "print(np.shape(cumulative_reward_history))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ea25c02b87a6e9b6",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "source": [
    "## Greedy Execution\n",
    "\n",
    "Test the learned policy by pure greedy execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-7272b1acfbd4b325",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v1', render_mode=\"human\")\n",
    "env = env.unwrapped\n",
    "\n",
    "nb_steps = 200\n",
    "\n",
    "state, _ = env.reset() # initialize x_0\n",
    "disc_state = tuple(discretize_state(state)) # use tuple indexing\n",
    "disc_action = pi[disc_state]\n",
    "\n",
    "for k in range(nb_steps):\n",
    "        \n",
    "    next_state, reward, next_action, done = interact(pi, action, True, epsilon)\n",
    "\n",
    "    state = next_state\n",
    "    action = next_action\n",
    "    if done:\n",
    "        break\n",
    "    \n",
    "env.close()"
   ]
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "celltoolbar": "Create Assignment",
  "kernelspec": {
   "display_name": "RL25",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
